Fritz: All Fritz

home *** CD-ROM | disk | FTP | other *** search

/ Fritz: All Fritz / All Fritz.zip / All Fritz / FILES / UTILREEN / PROCR.LZH / MANUAL.DOC < prev next >

Wrap

Text File | 1990-01-15 | 19KB | 469 lines

PROFESSIONAL OPTICAL CHARACTER RECOGNITION - PRO-CR<tm> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Copyright 1989, David P. Gray, Gray Design Associates All Rights Reserved Member, Association of Shareware Professionals -----------------------------[ C O N T E N T S ]----------------------------- 1. Specification. 2. System Requirements. 3. Files Distributed. 4. Revision History. 5. Future Versions. 6. USER GUIDE 6.1 Start-Up Procedure. 6.2 Feeding Input to PRO-CR<tm>. 6.3 Font Selection. 6.4 Non HP ScanJet Users. 6.5 Output Text File. 6.6 Performance. 6.7 Theory of Operation. 6.8 Menus. 6.9 Learn Mode. 6.10 Edit Mode. 6.11 Error Messages. 7. Site Licenses. 8. Comments to the Author. 9. Association of Shareware Professionals. 10. Miscellaneous. ----------------------------[ 1. SPECIFICATION ]---------------------------- * Reads 8 to 30 point mono and proportional fonts. * Up to 200 words per minute. * Supports HP ScanJet or any scanner that supports TIFF files (not suitable for hand-scanners). * Training and font editing supported with EGA or VGA adapter. * Real-time viewing of text during normal and training scan. * Continuous scanning if auto document feeder attached. * Upgrade to Version 2 when available. -------------------------[ 2. SYSTEM REQUIREMENTS ]------------------------- PRO-CR<tm> performs Optical Character Recognition on an IBM PC or compatible. The program will run on an XT or AT, however an AT is strongly recommended due to the highly cpu-intensive nature of the program. A graphics adapter is not required for basic character recognition, but is needed for the training and font-edit functions. If a graphics adapter is used, it should be EGA or VGA. (CGA does not have the required resolution). The minimum memory requirement is about 80Kb (512Kb is recommended), although the program adapts itself to use as much conventional memory as available. Version 1 does not support expanded or enhanced memory. A temporary disk-file is used for any parts of the scanned image that will not fit into memory at once. --------------------------[ 4. FILES DISTRIBUTED ]-------------------------- OCR.EXE : The PRO-CR<tm> program README.DOC : Important information HELP1.DOC : Text file used for online help HELP2.DOC : Text file used for online help HELP3.DOC : Text file used for online help MANUAL.DOC : This file COURIER.OCR : Font file ROMAN.OCR : Font file HELV.OCR : Font file IMAGE.TIF : Example TIFF file for processing NOTE: The text in the IMAGE.TIF file is in Courier. --------------------------[ 5. REVISION HISTORY ]--------------------------- 1.0 05/16/89 : Baseline version. 1.01 05/18/89 : Fixed character editing in font edit function, caused by bug in compiler's loop optimizer. 1.02 05/31/89 : Don't reject TIFFs with no bits_per_ sample tag. Assume a value of 1. 1.03 06/19/89 : Don't reject TIFFs with no resolution tags. 1.04 08/28/89 : Fixed bug in learn-mode. 1.05 11/29/89 : Fixed bug in Auto sheet feeder control. ----------------------------[ 6. FUTURE VERSIONS ]--------------------------- Version 2.0 is currently in progress. Estimated shipping date is first quarter of 1990. The following is a list of features expected to be included: * Enhanced speed and recognition rate. * Font independance (just hit the start button). * Mouse support. Selection of areas to be scanned. * Mixed text and graphics blocks for desktop publishers. * Direct support of Logitech hand-scanner. * Ability to handle compressed TIFFs, PCX and MSP formats. -------------------------[ 6. U S E R G U I D E ]------------------------ -------------------------[ 6.1 START-UP PROCEDURE ]------------------------- From the dos prompt, type: ocr -----------------------[ 6.2 FEEDING INPUT TO PRO-CR ]---------------------- There are 2 methods of supplying input to PRO-CR<tm>. 1. Direct scanning from an HP ScanJet. 2. Reading from a TIFF file produced by any other scanner. Direct scanning allows you to scan a single page if you have a flat-bed scanner only or optionally scan multiple pages if you have an automatic document feeder attached. Version 1 always scans entire pages. PRO-CR<tm> recognizes both mono-spaced and proportionally spaced fonts. It adjusts automatically to character size and will automatically switch fonts when more than one is selected. PRO-CR<tm> is trainable. A learning mode is provided to learn unrecognized shapes or new fonts. ---------------------------[ 6.3 FONT SELECTION ]--------------------------- PRO-CR<tm> provides a number of standard fonts for selection. More than one font may be selected in cases where you are not sure what font is on the page to be processed, or if there is more than one font on the page. For cases where only one font appears on the page to be scanned, selecting this font will generally give more accurate results and faster times than selecting all the fonts. However, the penalty for selecting all fonts is not great and is probably the best thing to do if you are in any doubt. If you are not sure what a particular font looks like, use the font editing feature to see the shapes of the default supplied fonts. (See the chapter on the edit mode). ------------------------[ 6.4 NON HP SCANJET USERS ]------------------------ Compatibility with non-HP scanners is made possible through the use of TIFF (tag image file format) files. Many scanners and desktop publishing programs use this standard file format. A resolution of 300 dots per inch gives a good compromise between accuracy and processing time. If the text you are scanning is large, over 12 points, you may wish to scan at a lower resolution, say 240 or 200 dpi to speed processing in PRO-CR<tm>. In general, though, the higher the resolution the better the accuracy. When reading from a TIFF file, PRO-CR<tm> looks for the file IMAGE.TIF Even if you do have an HP ScanJet you can still use it for cases when you do not wish to scan the whole page. Use the scanning program that came with the scanner to scan the part of the page containing the text you wish to process. Version 1 of PRO-CR<tm> does not read compressed TIFF files. --------------------------[ 6.5 OUTPUT TEXT FILE ]-------------------------- Whether scanning direct or reading from the TIFF file, all processed output is directed to a plain ASCII text file, default name TEXT.SAV. Version 1 does not support word processor attributes or file formats. You may change the the name of the output file in the "run" menu. Text is always appended to the file for each page scanned until you choose a new file name. -----------------------------[ 6.6 PERFORMANCE ]---------------------------- 6.6.1 Font Size ~~~~~~~~~~~~~~~~~ PRO-CR<tm> automatically adjusts itself to a range of point sizes within any document. The range is approximately 8 to 30 points. The low end depends on the quality of the document and the typeface used. These figures assume the image was scanned at 300 dots per inch. (The resolution used for the direct scanning mode). Learning mode allows a total of 12 fonts, with up to 90 shapes in each font and up to 3 fonts selectable simultaneously in learning or non-learning mode. 6.6.2 Processing Speed ~~~~~~~~~~~~~~~~~~~~~~~~ With one font selected, PRO-CR<tm> will process text at approximately 200 words per minute on a 20MHz 386 PC. 6.6.3 Error Rate ~~~~~~~~~~~~~~~~~~ The error rate is dependent on the quality of the text being processed and on the number of characters that "run together". In general the mono-spaced fonts such as Courier are easiest and the Roman font is the hardest to accurately recognize. For cases where characters run together, the learning mode can be used to help recognition. With good quality type the error rate is approximately 95% to 99% for Courier and 90% to 95% for Roman and Helvetica. -------------------------[ 6.7 THEORY OF OPERATION ]------------------------ 6.7.1 PRO-CR<tm> Uses Feature Extraction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With one proprietry global (topological) feature and two local features. The local features are optimized for the three supplied fonts. With all three fonts selected, good recognition is achieved on other non-stylized fonts via this combination of features. PRO-CR<tm> also includes a large number of ad-hoc positional and context sensitive tests. 6.7.2 Single Character Errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PRO-CR<tm> will not correctly recognize single characters 100% of the time. For every character guessed wrong, the reason is usually to be found on the document. Broken characters, skewed lines, misplaced text, smudges, to name a few. Sometimes it is just bad luck (for the technically minded, every signal processing system involves some noise. In this case the noise is in the scanning conversion and is a function of the scan resolution. Some characters look very much alike (S vs 5, b vs h) and one pixel dropped from the wrong place and appearing in another place can cause mis-recognition.) 6.7.3 Run-Ons ~~~~~~~~~~~~~~~ One of the biggest problems faced by Optical Character Recognizers are run- ons. The ultimate run-on is human hand writing in which all the characters are joined together. This kind of recognition is beyond the scope of most PC-based OCRs (including this one, except as follows:). PRO-CR<tm> recognizes mono-spaced fonts such as Courier and proportional fonts such as Helvetica and Roman. In tightly spaced proportional fonts many of the characters run into each other. (This can also happen in badly spaced mono- fonts.) The software only recognizes single objects and so gets very confused by characters joined together in this way. It attempts to split up such run-ons in an attempt to recognize them as two characters but will many times still fail. For cases where there are three characters it is almost certain to fail. The run-ons are rather dependant on the particular printer which printed the page and for this reason a learning mode is provided. This allows for learning unique shapes applicable to a particular document. It also provides a mechanism to learn a completely new font. Bear in mind that when learning a new font, best results are obtained with good clean type, 10 points or larger. Don't try working with any kind of script font where all characters flow together, you won't get very far ! -----------------------------[ 6.8 SELECT MODE ]---------------------------- The operator interface is implemented as a series of menu levels. Completion of one takes you to the next, selecting QUIT takes you to the previous (or back to DOS if at the first menu). The following is a list of menus and selections. MENU 1. (Select mode:) Select "Scan_mode" if you will be scanning direct or Select "File_mode" if you will be reading from your image.tif file. MENU 2. (Select font(s):) Select one or more fonts to be used when performing the ocr. The selected fonts are indicated by a check mark. Select the OK option to get to the run menu, menu 3. Menus 2a and 2b are for use when learning or editing a font. You can skip these menus during normal use and select OK to proceed to menu 3. Menu 3. (run) Select FILE NAME if you wish to change the name of the file which will be written to with the processed text. The default file is "text.sav". This is a plain ASCII text file which may be imported into any word processor or desktop publishing. If more than one scan is done, new information is appended to this file until you select a new file name. Select START for a single page scan or Select AUTO FEED for a multi page scan. This is only available when scanning directly, file mode will process everything in the file. Also, an automatic document feeder must be present and ready for use. -----------------------------[ 6.9 LEARN MODE ]----------------------------- MENU 2a. (Select font for learning:) Select the Learn option to select a font for learning or to add a brand new font for learning. Only one font can be learnt and is indicated by an "L" instead of a check mark. During the ocr you will be prompted for up to 3 characters for any unrecognized characters. If you are not sure what the un- recognized text is, press return. It will be ignored. Some points to note about learning mode is none of the 3 fonts include run-ons or, in other words, combinations of characters which are joined together. The reason for this is that the shapes of the joined together characters are largely printer dependent and so, what might work well for one document, would not work for another. In addition, the more shapes that are added to the font library, the more chance there is of choosing the wrong shape. There are 2 uses for learning (training) mode: 1. When there is a large amount of scanning to be done, for example a book, and it is worthwhile creating a special font just for this one document. Do not try to learn to the 3 fonts supplied, they are write protected. Instead, add a new font and learn to this. 2. Another use for the learning mode is to learn a new font from scratch. In this case best results will be obtained if you supply the font in the form of an alphabet, characters spaced well apart and in a large point size, say 14 or more. If necessary you can learn a completely new font just from the final copy to be processed but will not give the best results. The program will prompt less and less as it proceeds to learn the alphabet. It will often prompt for a character more than once. This is an indication of the variability of the characters scanned. Hints for learning: The learning mode uses any characters you give it to try to match new characters. In this way it should prompt you less and less as it learns the complete alphabet, eventually prompting only for joined or broken characters. However, you will find that on occasion it will prompt you for characters you have already entered. This is due to the fact that there is a recognition threshold set which is a compromise between recognizing a character that has not been learnt yet and prompting too often for characters already learnt. ------------------------------[ 6.10 EDIT MODE ]----------------------------- Use the edit mode to consolidate your learnt font, removing unwanted duplicate characters or runs and correcting any mistakes made when entering the string representation for the character shape. Do not try to enter a string for shapes that you do not recognize yourself, just hit return to skip to the next character during a learning session. Do not enter punctuation marks especially, these are handled with special algorithms. Some characters, such as o, u, v, x etc. are ambiguous with regard to case when viewed out of context. If you are unsure as to the case of a shape that the program prompts you with, either skip the character by entering return or simply enter the lower case version. The program has special algorithms for adjusting the case of such ambiguous characters. After a learning session, always run the program in non-learn mode using the new font to determine the results. You may use one font, for example one of the supplied fonts, while learning to a new font. MENU 2b. (Select font to edit:) Select the EDIT option to select a font for editing. Editing consists of deleting unwanted shapes from the font or changing the text which they represent. Follow the directions on the edit screen. Note that the default fonts supplied with PRO-CR<tm> are write protected so any attempt to learn to them or edit them will fail. ---------------------------[ 6.11 ERROR MESSAGES ]-------------------------- The following error codes may be seen, to do with TIFF files. 1 : Could not find the image.tif input file. 2 : Non-Intel byte order. The tif file is possibly a Mac file. 3 : Wrong value for bits_per_sample tag. 4 : Compressed TIFF file. This version does not handle compressed. 5 : Wrong value for photometric_interpretation tag. 6 : Wrong value for fill_order tag. 7 : Wrong picture orientation. 8 : Wrong value for samples_per_pixel tag. 9 : Wrong value for minimum_sample tag. 10 : Wrong value for maximum_sample tag. 11 : Wrong value for planar_configuration tag. 12 : Missing bits_per_sample tag. 13 : Missing image_width tag. 14 : Missing image_length tag. 15 : Missing image_pointer tag. 16 : Missing X_resolution tag. 17 : Missing Y_resolution tag. -----------------------------[ 7. SITE LICENSE ]---------------------------- COMPANIES please note that only ONE USER at ONE LOCATION may use and operate PRO-CR<tm>. Additional computers, users and locations should be registered separately, by volume, or by obtaining a site license. DISCOUNT RATES are offered to companies registering for a site license or by volume. Please write to Gray Design Associates, P.O. Box 333, Northboro, MA 01532, USA for a rate schedule. ------------------------[ 8. COMMENTS TO THE AUTHOR ]----------------------- Any feedback would be greatly appreciated. Please direct any comments to author personally via mail to David P. Gray, Gray Design Associates, P.O. Box 333, Northboro, MA 01532, USA. ----------------[ 9. ASSOCIATION OF SHAREWARE PROFESSIONALS ]--------------- This software is produced by David P. Gray who is a member of the Association of Shareware Professionals (ASP). ASP wants to make sure that the shareware principle works for you. If you are unable to resolve a shareware-related problem with an ASP member by contacting the member directly, ASP may be able to help. The ASP Ombudsman can help you resolve a dispute or problem with an ASP member, but does not provide technical support for members' products. Please write to the ASP Ombudsman at P.O. Box 5786, Bellevue, WA 98006, USA or send a CompuServe message via easyplex to ASP Ombudsman 70007,3536. ----------------------------[ 10. MISCELLANEOUS ]--------------------------- HP and ScanJet are registered trade marks of Hewlett Packard. ----------------------------[ END OF MANUAL.DOC ]----------------------------